BMM-Based Chinese Word Segmentor with Word Support Model for the SIGHAN Bakeoff 2006

نویسنده

  • Jia-Lin Tsai
چکیده

This paper describes a Chinese word segmentor (CWS) for the third International Chinese Language Processing Bakeoff (SIGHAN Bakeoff 2006). We participate in the word segmentation task at the Microsoft Research (MSR) closed testing track. Our CWS is based on backward maximum matching with word support model (WSM) and contextual-based Chinese unknown word identification. From the scored results and our experimental results, it shows WSM can improve our previous CWS, which was reported at the SIGHAN Bakeoff 2005, about 1% of F-measure.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Report to BMM-based Chinese Word Segmentor with Context-based Unknown Word Identifier for the Second International Chinese Word Segmentation Bakeoff

This paper describes a Chinese word segmentor (CWS) based on backward maximum matching (BMM) technique for the 2 nd Chinese Word Segmentation Bakeoff in the Microsoft Research (MSR) closed testing track. Our CWS comprises of a context-based Chinese unknown word identifier (UWI). All the context-based knowledge for the UWI is fully automatically generated by the MSR training corpus. According to...

متن کامل

POC-NLW Template for Chinese Word Segmentation

In this paper, a language tagging template named POC-NLW (position of a character within an n-length word) is presented. Based on this template, a twostage statistical model for Chinese word segmentation is constructed. In this method, the basic word segmentation is based on n-gram language model, and a Hidden Markov tagger based on the POC-NLW template is used to implement the out-of-vocabular...

متن کامل

A Pragmatic Chinese Word Segmentation System

This paper presents our work for participation in the Third International Chinese Word Segmentation Bakeoff. We apply several processing approaches according to the corresponding sub-tasks, which are exhibited in real natural language. In our system, Trigram model with smoothing algorithm is the core module in word segmentation, and Maximum Entropy model is the basic model in Named Entity Recog...

متن کامل

Word Boundary Token Model for the SIGHAN Bakeoff 2007

This paper describes a Chinese word segmentation system based on word boundary token model and triple template matching model for extracting unknown words; and word support model for resolving segmentation ambiguity.

متن کامل

Character Language Models for Chinese Word Segmentation and Named Entity Recognition

We describe the application of the LingPipe toolkit (Alias-i 2006) to Chinese word segmentation and named entity recognition. We provide results for the third SIGHAN Chinese language processing bakeoff (Levow 2006). F1 measures on the best performing corpora were .972 for word segmentation and .855 for person/location/organization named-

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006